Author: Hanna-Leena – title: “IODS course project” output: html_document: theme: cosmo toc: true toc_depth: 2 fig_caption: true fig_width: 6 fig_height: 4 —


About the project

Difficult course for me. Never done this before. Hoping to learn the basics of R-programme. I heard of the course from my colleagues. Write a short description about the course and add a link to your GitHub repository here https://github.com/Hanna-Leena/IODS-project. This is an R Markdown (.Rmd) file so you can use R Markdown syntax.


Regression and model validation

Description of the dataset

students2014<-read.csv("learning2014.csv", header = TRUE, sep= ",")

# Looking at the structure

str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
head(students2014)
##   gender Age Attitude     deep  stra     surf Points
## 1      F  53       37 3.583333 3.375 2.583333     25
## 2      M  55       31 2.916667 2.750 3.166667     12
## 3      F  49       25 3.500000 3.625 2.250000     24
## 4      M  53       35 3.500000 3.125 2.250000     10
## 5      M  49       37 3.666667 3.625 2.833333     22
## 6      F  38       38 4.750000 3.625 2.416667     21

This data has 166 observations which means 166 students. This data has 7 variables. The 7 variables are gender, age, attitude, deep, stra, surf and points. The attitude variable gives information of the students’ global attitude toward statistics. Points mean exam points and their points related to different aspects of learning (Deep, strategic and surface learning).

Show graphical overview of the data and show summaries of variables in the data

pairs(students2014[-1], col=students2014$gender)

summary(students2014)
##  gender       Age           Attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :32.00   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :31.43   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :50.00   Max.   :4.917   Max.   :5.000  
##       surf           Points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

This picture is not actually helping me to understand what kind of data I am prosessing. I like the summarize table and actually I use it quite often when getting familiar with my data. Let’s look at it more closely.

summary(students2014)
##  gender       Age           Attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :32.00   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :31.43   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :50.00   Max.   :4.917   Max.   :5.000  
##       surf           Points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

There are 110 females and 56 males. Minimum age is 17, median 22, mean 25.51, maximum 55. There are also quartiles you can use; 1st quartile 21, 3rd quartile 27. Exam points are minimum 7, median 23, mean 22.72, maximum 33.0, 1st quartile 19, 3rd quartile 27.75. Global attitude minimum 14, median 32, mean 31.43, max 50, 1st quartile 26, 3rd quartile 37. Deep questions points minimum 1.58, median 3.667, mean 3.68, 3rd quartile 4.083. And so on. Not everybody like these kinds of table but for me it gives quick insight into my data and I use it quite often.

Use the packages GGally and ggplot2 and get some help with the graphical overview

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

p <- ggpairs(students2014, mapping = aes(alpha=0.5), lower = list(combo = wrap("facethist", bins = 20)))

p

This way the data is more easily readable. Almost all the variables are normally distributes. The age is skewed. The picture also shows that there is not a very strong correlation between different variables since the correlations are between -0.3 - 0.4.

Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent) variable.

model <- lm(Points ~ gender + Attitude + stra, data = students2014)
summary(model)
## 
## Call:
## lm(formula = Points ~ gender + Attitude + stra, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.7179  -3.3285   0.5343   3.7412  10.9007 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97982    2.40303   3.737 0.000258 ***
## genderM     -0.22362    0.92477  -0.242 0.809231    
## Attitude     0.35100    0.05956   5.893 2.13e-08 ***
## stra         0.89107    0.54409   1.638 0.103419    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.304 on 162 degrees of freedom
## Multiple R-squared:  0.2051, Adjusted R-squared:  0.1904 
## F-statistic: 13.93 on 3 and 162 DF,  p-value: 3.982e-08

This test studied the association of exam points (target value, which was asked) with gender, attitude and strategic learning (explanatory variables). It can be seen that the Attitude is the only value that is statistically siginificant (p-value is < 0.05.). Actually the p-value is very low.

Next we are using a summary of my fitted model, to explain the relationship between the chosen explanatory variable and the target variable.

I am doing a regression model of with the one significant explanatory variable “Attitude”.

model_2 <- lm(Points ~ Attitude, data = students2014)
summary(model_2)
## 
## Call:
## lm(formula = Points ~ Attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

#These results mean that the attitude’s estimate is 0.35 and the p-value stays still <0.05 being statistically significant. In other words this means that when the attitude increases by 1 unit the exam points increase by 0.35.

Exlaining the the multiple R squared of the model.

R-squared is a statistical method showing how close the data is to the fitted regression line. Meaning how well does the model fit to my data. In general it can be said that the higher the R-squared, the better the model fits to the data. The definition of R-squared is the percentage of the response variable variation that is explained by a linear model.R-squared = Explained variation / Total variation R-squared, it is always between 0 and 100%. In this summary we can see that the multiple R-squared is 0.1906 so 19% which means that this model explains only 1/5 of the exam points around their mean. R-squared does not determine whether the coefficient estimates and predictions are biased. This is why you must assess the residual plots.

Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.

There are several assumptions in the linear regression model. With the help of these plots you can analyze the residuals of the model and see how well the linear regression model is working or are the some problems with it.

plot(model_2, which = c(2,1,5))

Residuals vs Fitted values

This plot shows that the residuals have constant variance. You can find an equally spread residuals around the horizontal line without distinct patterns.

Q-Q plot (normality)

With the Q-Q plot you can explore that the residuals are normally distributed. As you can see the points are very close to the line. There are the upper and lower tails which have some deviation. I think this is acceptable. I would interpret that the errors are normally distributed.

Residuals vs Leverage

This plot helps you to understand if there are outliers in the data that are influencial in the linear regression model. In this analysis all the cases are inside the Cook’s distance lines.


Week 3, Logistic Regression

Reading the data

This week we will learn about logistic regression. Before analysing the data I have been done some wrangeling. It wasn’t as diffucult as week before. Now the data is ready to be analyzed. The following changes have been made: the variables not used for joining the two data have been combined by averaging. The variable ‘alc_use’ is the average of ‘Dalc’ and ‘Walc’and ’high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise.

alc <- read.csv("alc2014.csv", header = TRUE, sep = ",")

#Looking at the structure

dim(alc)
## [1] 382  35
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
summary(alc)
##  school   sex          age        address famsize   Pstatus
##  GP:342   F:198   Min.   :15.00   R: 81   GT3:278   A: 38  
##  MS: 40   M:184   1st Qu.:16.00   U:301   LE3:104   T:344  
##                   Median :17.00                            
##                   Mean   :16.59                            
##                   3rd Qu.:17.00                            
##                   Max.   :22.00                            
##       Medu            Fedu             Mjob           Fjob    
##  Min.   :0.000   Min.   :0.000   at_home : 53   at_home : 16  
##  1st Qu.:2.000   1st Qu.:2.000   health  : 33   health  : 17  
##  Median :3.000   Median :3.000   other   :138   other   :211  
##  Mean   :2.806   Mean   :2.565   services: 96   services:107  
##  3rd Qu.:4.000   3rd Qu.:4.000   teacher : 62   teacher : 31  
##  Max.   :4.000   Max.   :4.000                                
##         reason    nursery   internet    guardian     traveltime   
##  course    :140   no : 72   no : 58   father: 91   Min.   :1.000  
##  home      :110   yes:310   yes:324   mother:275   1st Qu.:1.000  
##  other     : 34                       other : 16   Median :1.000  
##  reputation: 98                                    Mean   :1.448  
##                                                    3rd Qu.:2.000  
##                                                    Max.   :4.000  
##    studytime        failures      schoolsup famsup     paid     activities
##  Min.   :1.000   Min.   :0.0000   no :331   no :144   no :205   no :181   
##  1st Qu.:1.000   1st Qu.:0.0000   yes: 51   yes:238   yes:177   yes:201   
##  Median :2.000   Median :0.0000                                           
##  Mean   :2.037   Mean   :0.2016                                           
##  3rd Qu.:2.000   3rd Qu.:0.0000                                           
##  Max.   :4.000   Max.   :3.0000                                           
##  higher    romantic      famrel         freetime        goout      
##  no : 18   no :261   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  yes:364   yes:121   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:2.000  
##                      Median :4.000   Median :3.00   Median :3.000  
##                      Mean   :3.937   Mean   :3.22   Mean   :3.113  
##                      3rd Qu.:5.000   3rd Qu.:4.00   3rd Qu.:4.000  
##                      Max.   :5.000   Max.   :5.00   Max.   :5.000  
##       Dalc            Walc           health         absences   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   : 0.0  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 1.0  
##  Median :1.000   Median :2.000   Median :4.000   Median : 3.0  
##  Mean   :1.482   Mean   :2.296   Mean   :3.573   Mean   : 4.5  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.: 6.0  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :45.0  
##        G1              G2              G3           alc_use     
##  Min.   : 2.00   Min.   : 4.00   Min.   : 0.00   Min.   :1.000  
##  1st Qu.:10.00   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:1.000  
##  Median :12.00   Median :12.00   Median :12.00   Median :1.500  
##  Mean   :11.49   Mean   :11.47   Mean   :11.46   Mean   :1.889  
##  3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:2.500  
##  Max.   :18.00   Max.   :18.00   Max.   :18.00   Max.   :5.000  
##   high_use      
##  Mode :logical  
##  FALSE:268      
##  TRUE :114      
##                 
##                 
## 

The modified data consists of 382 observations of 35 variables as seen above. The data tells students’ achievements in secondary education of two Portuguese schools. This data was collected by questionaires and school reports. Originally we had two datasets regarding the performance in two distinct subjects: Mathematics (called mat) and Portuguese language (called por). If you want, you can find more information about the original dataset here: https://archive.ics.uci.edu/ml/datasets/Student+Performance.

The purpose of my analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data.

To do this I choosed 4 variables in the data. And for each of them I had present my own personal hypothesis of their relationship with alcohol consumption. I chose the variables failure, absences, famrel and higher. 1. Variable “failure” tells about the number of past class failures which might be linked to higher alcohol consumption. When you think with common sense these two variables could have correlation, which is why I chose them. 2. The second chosen variable is “absences”. The variables measures school absences. Again when you think with common sense it could be explainde that those who drink more alchol have more school absences. Thus is why I want to see is there correlation or not. 3. The third chosen variable is “famrel”. It means the quality of family relationships. It is commonly known that lower social status is connected to more alchol consumption. Now we will find out is there correlation between these two variables in this data. 4. The fourth chosen variable is “higher”. This variable tells if the student wants to achieve higher education. It can be argued that those who study more and harder, have lesser time for partying and for instance alchol use. So we will find out it those who want a higher education consume alcohol less.

Next wer are going to numerically and graphically explore the distributions of my chosen variables and their relationships with alcohol consumption. We are going to use for example cross-tabulations, bar plots and box plots.

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

First we explore the relationship between alcohol consumption and number of past class failures

table_1 <- table(high_use = alc$high_use, number_of_past_class_failures = alc$failures)

addmargins(table_1)
##         number_of_past_class_failures
## high_use   0   1   2   3 Sum
##    FALSE 244  12  10   2 268
##    TRUE   90  12   9   3 114
##    Sum   334  24  19   5 382
failurel <- ggplot(alc, aes(x = high_use, y = failures))
failurel + geom_count()

From this table and plot-picture you can see that the ones who have no past class failures consume less alcohol. Those ones who consume alcohol more have more class failures.

Second we explore the relationship between alcohol consumption and absences

g <- ggplot(alc, aes(x = high_use, y = absences))

g2 <- g + geom_boxplot() + ggtitle("High alcohol consumption and absences")

g2

From this picture we can see that those ones who consume less acohol they have less school absences. Those onse who consume more alcohol have more school absences. So there might be correlation between these two variables.

Third we explore the correlation between acohol consumption and family relationships

g3 <- ggplot(alc, aes(x = high_use, y = famrel))

g4 <- g3 + geom_boxplot()

g4

The students who consume less acohol have higher scores from their family relationships. Those student who consume more alcohol have lower scores from their family relationships. From this you can not say which one causes which. Do the students start using alcohol because of bad family relationships or do family relationships get worse when the student starts using more alcohol.

Fourth we are exploring the hopes for higher education and alcohol consumption

table_2 <- table(high_use = alc$high_use, wants_high_education = alc$higher)

table_2
##         wants_high_education
## high_use  no yes
##    FALSE   9 259
##    TRUE    9 105
round(prop.table(table_2) * 100, 1)
##         wants_high_education
## high_use   no  yes
##    FALSE  2.4 67.8
##    TRUE   2.4 27.5

From these two tables you can see that it looks like those who consume more alcohol, they don’t want to have higher education.

Next wer are going to use logistic regression to statistically explore the relationship between my chosen variables and the binary high/low alcohol consumption variable as the target variable.

m <- glm(high_use ~ failures + absences + famrel + higher, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ failures + absences + famrel + higher, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3593  -0.7890  -0.6902   1.1782   1.8993  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.12562    0.74528  -0.169 0.866143    
## failures     0.43484    0.19618   2.217 0.026653 *  
## absences     0.08341    0.02272   3.671 0.000241 ***
## famrel      -0.22708    0.12502  -1.816 0.069322 .  
## higheryes   -0.36274    0.55177  -0.657 0.510924    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 436.43  on 377  degrees of freedom
## AIC: 446.43
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)    failures    absences      famrel   higheryes 
## -0.12562333  0.43483926  0.08340567 -0.22707832 -0.36273561

As seen above the p-value is low (<0.05) in failures and absences. In famrel the p-value is close to 0.05. The coefficient in failures and absences is positive (failures 0.44, absences 0.08), meaning that more failures and more absences predict higher alcohol use. Familyrelationship seems to have negative effect (-0.22) on the high alcohol use, which means that better familyrelationship have protective effect on alcohol use. The future education plan meaning the plans to get a higher education have also negative effect on the high alcohol use (-0.4), but the p-value is not significant.

Next we are trying to present and interpret the coefficients of the model as odds ratios and provide confidence intervals for them.

library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble  2.1.3     ✔ purrr   0.3.3
## ✔ readr   1.3.1     ✔ stringr 1.4.0
## ✔ tibble  2.1.3     ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(dplyr)
OR <- coef(m) %>% exp()
CI <- exp(confint(m))
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR     2.5 %   97.5 %
## (Intercept) 0.8819470 0.1995909 3.776889
## failures    1.5447147 1.0499664 2.282173
## absences    1.0869827 1.0418018 1.139025
## famrel      0.7968584 0.6230747 1.019051
## higheryes   0.6957704 0.2372414 2.121797

From this table you can see the coefficients of the model as oddss ratios and their confidence intervals. The odds ratios for failures is 1.54 (and confidence interval is 1.05-2.28). Variable absences odds ratio is 1.09 (with confidence interval 1.04-1.14). This means that higher consumption of alcohol increases the odds for absence and failures by 1.54 and 1.08 times. The odss ratio for the better family relationship is 0.8 with confidence interval 0.6 - 1.02, which means that the better family relationship makes the odds for high alcohol consumption 0.8 times less likely. The fourth tested variable the higher educational plan was not statistically significant since the odds ratio is 0.70 and the confidence interval is 0.23-2.12. When a number 1 is included in the confidence interval, this means that there is no statistically significant difference. The findings we see can be said to support the earlier hypotheses we had.

Exploring the predictive power of the model

We are going to use the variables which, according to my logistic regression model, had a statistical relationship with high/low alcohol consumption. We are going to explore the predictive power of my model.

m <- glm(high_use ~ failures + absences + famrel + higher, data = alc, family = "binomial")

probabilities <- predict(m, type = "response")

alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
select(alc, failures, absences, famrel, high_use, probability, prediction) %>% tail(10)
##     failures absences famrel high_use probability prediction
## 373        1        0      4    FALSE   0.2765114      FALSE
## 374        1        7      5     TRUE   0.3531843      FALSE
## 375        0        1      5    FALSE   0.1764851      FALSE
## 376        0        6      4    FALSE   0.2898242      FALSE
## 377        1        2      5    FALSE   0.2646186      FALSE
## 378        0        2      4    FALSE   0.2262058      FALSE
## 379        2        2      2    FALSE   0.5234763       TRUE
## 380        0        3      1    FALSE   0.3857482      FALSE
## 381        0        4      2     TRUE   0.3523118      FALSE
## 382        0        2      4     TRUE   0.2262058      FALSE
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65445026 0.04712042 0.70157068
##    TRUE  0.25392670 0.04450262 0.29842932
##    Sum   0.90837696 0.09162304 1.00000000
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

g + geom_point()

The training error

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.3010471

The training errors result is 0.30 meaning 30%. This means that almost 30% a.k.a 1/3 of observations are incorrectly classified. I do not know if this is very common when doing a model. I think the number is quite high and the model is not working as it should be working.

Perform 10-fold cross validation on this model

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) >0.5
  mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.3010471
library(boot)

cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

cv$delta[1]
## [1] 0.3141361

Even after 10-fold cross validation the result remains the same. So 30% of observations are incorrectly classified. ***

Week 4, Excercise 4, Clustering and classification

This week we are learning how to cluster and classificate. First we need to load the Boston-data, from the MASS package.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")

#Then we explore the structure and the dimensions of the data

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The data has 506 objects and 14 variables.This dataset is about housing values in suburbs of Boston. The variables are shortened, so for us to understand what they mean I have explainde the variables below. Variables: “Crim” means per capita crime rate by town. “zn” means proportion of residential land zoned for lots over 25,000 sq.ft. “indus” means proportion of non-retail business acres per town. “chas” means Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). “nox” means nitrogen oxides concentration (parts per 10 million) “rm” means average number of rooms per dwelling “age” means proportion of owner-occupied units built prior to 1940 “dis” means weighted mean of distances to five Boston employment centres “rad” means index of accessibility to radial highways “tax” means full-value property-tax rate per $10,000. “ptratio” means pupil-teacher ratio by town “black” means 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town “lstat” means lower status of the population (percent) “medv” means median value of owner-occupied homes in $1000s

Then we are exploring the data by doing a graphical overview and by showing summaries of the variables.

First I downlond few packages I could need during the excercise.

library(ggplot2)
library(GGally)
library(tidyverse)

Next we are going to look at the summary of the Boston dataset

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

If you look at the variables more closely, you can see that almost all the variables are normally distributed. You can know this by seeing that median and mean values are more or less close to each other. Variables “Zn” and “Crim” are not normally distributed. The variable “Chas” is a binary variable.

Next were are going to look for the correlations of the dataset.

#First a picture of the graphical overview

pairs(Boston)

This picture is not very clear. It is a picture of the pairs plot. All the blots are so small and so near to each other thatyou can actually see only black picture. From this picture you cannot see the correlations so next we are going to look at the correlations more closely.

#Let's make a correlation matrix and draw a correlation plot

cormatrix <- cor(Boston)
cormatrix %>% round(digits = 2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
library(corrplot)
## corrplot 0.84 loaded
corrplot(cormatrix, method = "circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

My R couldn’t find a library corrplot, even tough that was instructed in the datacamp. Finally I found a package of corrplot. So after downloading that I could look at the correlations in this picture, which is more clear to me. There is strong negative correlation (big red ball), between dis-nox, dis-age and dis-indus. Meaning that moving furher from Boston employment centers the Nitrogen oxide concentration goes down, the proportion of owner-occupied units built prior to 1940 goes down. This seems clear and logical. Also lower status of the population (lstat) and median value of owner-occupied homes (medv) have strong neagtive correlation. When the percent of lower status of the population gets bigger the median value of owner-occupied homes in $1000s gets smaller. This also is logical. A positive correlation is marked with a big blue ball. So if you look at the picture, you can see that rad and tax have a strong postive correlation. This means that when the index of accessibility to radial highways rises also the full-value property-tax rate per $10,000 rises.

Standardize the dataset and print out summaries of the scaled data

Standardizing the dataset

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
# The boston_scaled is a matrix and we are going to change it to a data for the future
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)

After the data has been scaled you can see from the summary that all the means and medians are close to each other meaning that now they are normally distributed. This will help us in scaling this dat in a fitted model. Next we are going to change the continuous crime rate variable into a categorical variable. We need to cut the crim variable by quantiles to get the high, low and middle rates of crime into their own categories. Then we are going to drop the old crim variable from the dataset and replace it with the new crime variable.

# Creating a quantile vector

bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# creating a categorical variable crime

crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))

table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
library(dplyr)

boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
summary(boston_scaled)
##        zn               indus              chas              nox         
##  Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723   Min.   :-1.4644  
##  1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723   1st Qu.:-0.9121  
##  Median :-0.48724   Median :-0.2109   Median :-0.2723   Median :-0.1441  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723   3rd Qu.: 0.5981  
##  Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648   Max.   : 2.7296  
##        rm               age               dis               rad         
##  Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819  
##  1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373  
##  Median :-0.1084   Median : 0.3171   Median :-0.2790   Median :-0.5225  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596  
##  Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596  
##       tax             ptratio            black             lstat        
##  Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033   Min.   :-1.5296  
##  1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049   1st Qu.:-0.7986  
##  Median :-0.4642   Median : 0.2746   Median : 0.3808   Median :-0.1811  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332   3rd Qu.: 0.6024  
##  Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406   Max.   : 3.5453  
##       medv              crime    
##  Min.   :-1.9063   low     :127  
##  1st Qu.:-0.5989   med_low :126  
##  Median :-0.1449   med_high:126  
##  Mean   : 0.0000   high    :127  
##  3rd Qu.: 0.2683                 
##  Max.   : 2.9865
dim(boston_scaled)
## [1] 506  14

The data has still 506 objects and 14 variables. The instructions said that we need to divide the dataset to train and test sets, so that only 80% of the data belongs to the train set.

# We are going to make the train and the test sets by choosing 80% observations to the train set and the rest of the observations are in the test set.
n <- nrow(boston_scaled)
n
## [1] 506
ind <- sample(n, size = n * 0.8)
dim(ind)
## NULL
train <- boston_scaled[ind, ]
str(train)
## 'data.frame':    404 obs. of  14 variables:
##  $ zn     : num  -0.487 -0.487 -0.487 -0.487 0.456 ...
##  $ indus  : num  0.247 1.231 -0.376 1.015 -0.769 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -1.016 0.434 -0.299 1.254 -1.067 ...
##  $ rm     : num  0.00194 -0.58301 0.26951 -1.27329 2.81002 ...
##  $ age    : num  -0.838 0.925 1.013 1.077 -2.138 ...
##  $ dis    : num  0.336 -0.65 -0.647 -0.982 2.428 ...
##  $ rad    : num  -0.522 -0.522 -0.522 1.66 -0.293 ...
##  $ tax    : num  -0.0607 -0.0311 -0.1438 1.5294 -0.4642 ...
##  $ ptratio: num  0.113 -1.735 1.129 0.806 0.298 ...
##  $ black  : num  0.291 -0.705 0.422 0.441 0.441 ...
##  $ lstat  : num  -0.52 0.2488 -0.0536 1.1479 -1.2761 ...
##  $ medv   : num  -0.123 -0.558 -0.297 -1.2 2.204 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 2 3 2 4 3 3 4 1 2 3 ...

So now the train set had 404 observations and 14 variables. This is 80% of the 506 observations we began with.

Next we are creating the test set.

test <- boston_scaled[-ind, ]
str(test)
## 'data.frame':    102 obs. of  14 variables:
##  $ zn     : num  0.0487 -0.4872 -0.4872 -0.4872 -0.4872 ...
##  $ indus  : num  -0.476 -0.437 -0.437 -0.437 -0.437 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.265 -0.144 -0.144 -0.144 -0.144 ...
##  $ rm     : num  0.131 -0.478 -0.455 -0.671 0.554 ...
##  $ age    : num  0.914 -0.241 0.733 1.116 0.665 ...
##  $ dis    : num  1.212 0.433 0.103 0.143 0.211 ...
##  $ rad    : num  -0.522 -0.637 -0.637 -0.637 -0.637 ...
##  $ tax    : num  -0.577 -0.601 -0.601 -0.601 -0.601 ...
##  $ ptratio: num  -1.5 1.18 1.18 1.18 1.18 ...
##  $ black  : num  0.393 0.441 0.393 0.415 0.258 ...
##  $ lstat  : num  1.0918 -0.6152 0.1648 1.012 -0.0943 ...
##  $ medv   : num  -0.819 -0.232 -0.319 -0.873 -0.167 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 2 3 3 3 3 3 1 2 1 2 ...

The test set has 102 observations and 14 variables. This is 20% of the original data set.

Next were are going to fit the linear discriminant analysis on the train set

We are using the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. We are going to draw the LDA (bi)plot of the model.

#Fitting the linear discriminant analysis on the train set

lda.fit <-  lda(formula = crime ~ ., data = train)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
#Target classes as numeric 

classes <- as.numeric(train$crime)

#Drawing a plot of the lda results

plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

#Next we are going to save the crime categories from the test set and then remove the categorical crime variable from the test dataset. After that we are going to predict the classes with the LDA model on the test data.

#We are going to save the crime categories from the test set

correct_classes <- test$crime

library(dplyr)

#Next we are going to remove the categorial crime variable from the test dataset

test <- dplyr::select(test, -crime)

# Next we are going to predict with the LDA model on the test data

lda.pred <- predict(lda.fit, newdata = test)

#Then was asked to do a cross table of the results

table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       14       9        0    0
##   med_low    5      14       12    0
##   med_high   1       5       17    3
##   high       0       0        1   21

Here you can see how the model is working with the predictions. The model works well when predicting the high crime rates. The model makes errors when predicting the other crime classes.

#Next we are going load the Boston dataset again and standardize the dataset. We are going to calculate the distances between the observations. We are going to run k-means algorithm on the dataset. We are going to investigate what is the optimal number of clusters and run the algorithm again. In the end we are going to visualize the clusters and interpret the results.

library(MASS)
data("Boston")
dim(Boston)
## [1] 506  14

We now have loaded the Boston dataset again from the MASS-library. We wanted to check that we have the correct amount of observations 506 and variables 14.

Next we are going to measure the distance. I am going to use the Euklidean-distance, which is probably the most common one. K-means is old and often used clustering method. K-means counts the distance matrix automatically but you have to choose the number of clusters. I made the model with 3 clusters, because my opinion is that it worked better than 4 clusters.

scale_Boston2 <- scale(Boston)
scale_Boston2 <- as.data.frame(scale_Boston2)

#Next we are going to calculate the distances, the Euklidean-distance.

dist_eu <- dist(scale_Boston2)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# K-means clustering

km <- kmeans(scale_Boston2, centers = 3)
#Plotting the Boston dataset with clusters
pairs(scale_Boston2, col = km$cluster)

pairs(scale_Boston2[1:6], col = km$cluster)

pairs(scale_Boston2[7:13], col = km$cluster)

Next were are going to investigate what is the optimal number of clusters. There are many ways to find out the optimal number of clusters, but we will use the Total of within cluster sum of squares (WCSS) and visualise the result with a plot.

# First determine the number of clusters

k_max <- 10

# Calculate the total within sum of squares

twcss <- sapply(1:k_max, function(k){kmeans(scale_Boston2, k)$tot.withinss})

# Next we are going to visualize the results

qplot(x = 1:k_max, y = twcss, geom = "line")

The optimal number of clusters is when the total WCSS drops radically. As you can see from the picture above this happens around, when x= 2. So the optimal number of clusters would be 2. Next we run the algorithm again with two clusters.

km <- kmeans(scale_Boston2, centers = 2)

pairs(scale_Boston2, col = km$cluster)

In this first picture all the variables are included. Because there are so many variables I think the picture is quite difficult to interpret and understand. That is why I choose to do two more plots, so that I can look at the effects more closely.

pairs(scale_Boston2[1:8], col = km$cluster)

pairs(scale_Boston2[6:13], col = km$cluster)

As you can see from the pictures above the variable “chas” doesn’t follow any pattern with any of the variables. These kind of pictures are hard for me to understand because I am doing this the very first time. I think, however, that there might be negative correlation between indus-dis, nox-dis, dis-lstat and positive correlation between indus-nox, age-nox, age-lstat.


Week 5

Dimensionality reduction techniques. I had huge problems with the data wrangling and after all the tricks I did, I couldn’t get the observations and variables correct, so I am now reading the processed data that was handed out to us.

human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep=",", header=TRUE)

Looking that we have now the right amount of observations and variables. In the human data after the data wrangling there should have been 155 observations and 8 variables. So as you can see, we have now the correct data in hands.

dim(human)
## [1] 155   8

First let’s look at the data in hand

This week we are going to learn dimensionality and reduction techniques. We will work on a data called the human-data. This dataset originates from the United Nations Development Programme and it is about measuring the development of a country with Human Development Index (HDI). More information about the data can be looked from these two webpages: http://hdr.undp.org/en/content/human-development-index-hdi and http://hdr.undp.org/sites/default/files/hdr2015_technical_notes.pdf.

Like said before my data wrangling did not work and I ended up having the wrong amount of observations and variables, so I did the analysis with the given data.

We have read the data from the webpage and looked that we have now the correct amount of observations and variables. Let’s look at the variable names.

colnames(human)
## [1] "Edu2.FM"   "Labo.FM"   "Edu.Exp"   "Life.Exp"  "GNI"       "Mat.Mor"  
## [7] "Ado.Birth" "Parli.F"

Next we are going to look at the summary of the data.

summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

You can see now the means, the medians and the quartiles of the variables.

head(human)
##               Edu2.FM   Labo.FM Edu.Exp Life.Exp   GNI Mat.Mor Ado.Birth
## Norway      1.0072389 0.8908297    17.5     81.6 64992       4       7.8
## Australia   0.9968288 0.8189415    20.2     82.4 42261       6      12.1
## Switzerland 0.9834369 0.8251001    15.8     83.0 56431       6       1.9
## Denmark     0.9886128 0.8840361    18.7     80.2 44025       5       5.1
## Netherlands 0.9690608 0.8286119    17.9     81.6 45435       6       6.2
## Germany     0.9927835 0.8072289    16.5     80.9 43919       7       3.8
##             Parli.F
## Norway         39.6
## Australia      30.5
## Switzerland    28.5
## Denmark        38.0
## Netherlands    36.9
## Germany        36.9

The data consists of 155 countries and 8 variables of each country. You one can see the name of the country and then the variables giving indications about the country’s development.

Empowerment: Edu2.FM = Proportion of females with at least secondary education / Proportion of males with at least secondary education Labo.FM = Proportion of females in the labour force / Proportion of males in the labour force Parli.F = Percetange of female representatives in parliament.

Health and knowledge: Life.Exp = Life expectancy at birth Edu.Exp = Expected years of schooling. GNI = Gross National Income per capita Mat.Mor = Maternal mortality ratio. Ado.Birth = Adolescent birth rate.

#Let's look at the structure of the data

str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
library(tidyverse)
library(dplyr)
library(ggplot2)
library(GGally)

#Let's look at a graphical overview of the data

gather(human) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

h <- ggpairs(human, mapping = aes(alpha = 0.5), lower = list(combo = wrap("facethist", bins = 20)))
h

Here you cann be see the graphical overviews of the data. All the variables are numeric. Only one variable the Edu.Exp is normallyt distributed.

##Next we are going to study the relationships between the variables with correlation matrix

#We are going to do a correlation matrix and draw a correlation plot to look at the relationships between the variables
library(corrplot)
library(tidyverse)

correlation <- cor(human)
correlation %>% round(digits = 2)
##           Edu2.FM Labo.FM Edu.Exp Life.Exp   GNI Mat.Mor Ado.Birth Parli.F
## Edu2.FM      1.00    0.01    0.59     0.58  0.43   -0.66     -0.53    0.08
## Labo.FM      0.01    1.00    0.05    -0.14 -0.02    0.24      0.12    0.25
## Edu.Exp      0.59    0.05    1.00     0.79  0.62   -0.74     -0.70    0.21
## Life.Exp     0.58   -0.14    0.79     1.00  0.63   -0.86     -0.73    0.17
## GNI          0.43   -0.02    0.62     0.63  1.00   -0.50     -0.56    0.09
## Mat.Mor     -0.66    0.24   -0.74    -0.86 -0.50    1.00      0.76   -0.09
## Ado.Birth   -0.53    0.12   -0.70    -0.73 -0.56    0.76      1.00   -0.07
## Parli.F      0.08    0.25    0.21     0.17  0.09   -0.09     -0.07    1.00

Then the correlation plot

corrplot.mixed(correlation, lower.col = "black",  number.cex = .6)

Here you can see that the percetange of female representatives in parliament (variable Parli.F) or proportion of females in the labour force / proportion of males in the labour force (variable Labo.FM) don’t have any strong correlations with any of the other variables.

The maternal mortality ratio (variable Mat.Mor) and life expectancy (variable Life.Exp) have strong negative correlation, which means that when maternal mortality ratio gets higher life expactancy gets lower. Also adolescence birth ratio (variable Ado.Birth) has strong negative correlation with life expectancy (variable Life.Exp). Higher education (variable Edu.Exp) and variable GNI seems to affect positively to life expactancy (variable Life.Exp).

#Principal Component Analysis

The next step was to do a PCA (princiapl componen analysis) to the non-standardized data. Principal Component Analysis (PCA) can be performed by two slightly different matrix decomposition methods from linear algebra: the Eigenvalue Decomposition and the Singular Value Decomposition (SVD). The function prcomp() function uses the SVD and is the preferred, more numerically accurate method.

#First let's make a PCA with the SVD-method

pca_human <- prcomp(human)
sum_pca_human <- summary(pca_human)
sum_pca_human
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000

This is the PCA with the SVD-method.

pca_pr <- round(100*sum_pca_human$importance[2, ], digits = 1)
pc_lab <- paste0(names(pca_pr), "(", pca_pr, "%)")


#Drawing the biplot
biplot(pca_human, choices = 1:2, cex = c(0.8,1.0), col=c("coral", "black"), xlab = pc_lab[1], ylab = pc_lab[2], main = "PCA plot of non-scaled human data")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

As you can see almost all the variables are gathered in a one corner and we have only and one arrow. The other arrows are zero of length and indeterminate in angle so they are skipped. Because the PCA is sensitive to the relative scaling of the original features and it assumes that features with larger variance are more important than features with smaller variance. So without scaling the biplot looks like above. The GNI has the largest variance, so it becomes dominant as you can see.

#Next we are going to standardize the variables in the human data and repeat the above analysis

human_scaled <- scale(human)
str(human_scaled)
##  num [1:155, 1:8] 0.639 0.596 0.54 0.562 0.481 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:155] "Norway" "Australia" "Switzerland" "Denmark" ...
##   ..$ : chr [1:8] "Edu2.FM" "Labo.FM" "Edu.Exp" "Life.Exp" ...
##  - attr(*, "scaled:center")= Named num [1:8] 8.53e-01 7.07e-01 1.32e+01 7.17e+01 1.76e+04 ...
##   ..- attr(*, "names")= chr [1:8] "Edu2.FM" "Labo.FM" "Edu.Exp" "Life.Exp" ...
##  - attr(*, "scaled:scale")= Named num [1:8] 2.42e-01 1.99e-01 2.84 8.33 1.85e+04 ...
##   ..- attr(*, "names")= chr [1:8] "Edu2.FM" "Labo.FM" "Edu.Exp" "Life.Exp" ...

Let’s look at the summary of the scaled data

summary(human_scaled)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850

As you can see the means and the medians are very close to each other.

class(human_scaled)
## [1] "matrix"
human_scaled <- as.data.frame(human_scaled)
#Let's make the PCA again with using the SVD-method

pca_human_s <- prcomp(human_scaled)

sum_pca_human_s<-summary(pca_human_s)
pca_pr_s <- round(100*sum_pca_human_s$importance[2, ], digits = 1)
pc_lab<-paste0(names(pca_pr_s), " (", pca_pr_s, "%)")

sum_pca_human_var_s<-sum_pca_human_s$sdev^2
sum_pca_human_var_s
## [1] 4.2883701 1.2989625 0.7657100 0.6066276 0.4381862 0.2876242 0.2106805
## [8] 0.1038390

Next let’s draw the biplot

biplot(pca_human_s, choices = 1:2, cex= c(0.5,1.0), col=c("coral", "black"), xlab = pc_lab[1], ylab = pc_lab[2], main = "PCA plot of scaled human data")

As you can see now we have more than one arrow. So, the standardization works. Now the relative scaling between the variables is more similar and the GNI, which had the largest variance, doesn’t “run over” the other variables.

Edu.Exp, GNI, Edu.FM and Life.Exp are close together and the arrows share a small angle, which means that these variables have high positive correlation. The arrows of Mat.Mor and Ado.Birth are directed to the opposite direction, which means that they have high negative correlation with the variables mentioned priorly. All these factors have high angle with Labo.FM and Parli.F, which means that there is not a high correlation.

The angle between a feature and a PC axis can be understand as the correlation between the two. Small angle means high positive correlation.

The length of the arrows are proportional to the standard deviations of the variables and they seem to be pretty similar with the different variables.

#Next we are going to do multiple correspondence analysis MCA

library(FactoMineR)

You need to download the FactoMineR package to do the MCA. From this package we will use the “tea” -data for the analysis. After loading the package, let’s look how the tea-data looks like.

data(tea)
colnames(tea)
##  [1] "breakfast"        "tea.time"         "evening"         
##  [4] "lunch"            "dinner"           "always"          
##  [7] "home"             "work"             "tearoom"         
## [10] "friends"          "resto"            "pub"             
## [13] "Tea"              "How"              "sugar"           
## [16] "how"              "where"            "price"           
## [19] "age"              "sex"              "SPC"             
## [22] "Sport"            "age_Q"            "frequency"       
## [25] "escape.exoticism" "spirituality"     "healthy"         
## [28] "diuretic"         "friendliness"     "iron.absorption" 
## [31] "feminine"         "sophisticated"    "slimming"        
## [34] "exciting"         "relaxing"         "effect.on.health"

Here you can see what kind of variables you have in the tea-data.

Let’s look at the summary next

summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255  
##  Not.feminine:171   sophisticated    :215   slimming   : 45  
##                                                              
##                                                              
##                                                              
##                                                              
##                                                              
##         exciting          relaxing              effect.on.health
##  exciting   :116   No.relaxing:113   effect on health   : 66    
##  No.exciting:184   relaxing   :187   No.effect on health:234    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
## 

Let’s look at the structure of the data

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

As you can see the tea-data has 300 observations and 36 variables. Almost all the variables are categorial, with 2 to 7 categories. Only the variable “age” is not a categorial variable.

Next we want to look the graphical overview of the data. With MCA you can analyse the pattern of relationships of several categorical variables. MCA deals with categorical variables, but continuous ones can be used as background variables

For the categorical variables, you can either use the indicator matrix or the Burt matrix in the analysis. The Indicator matrix contains all the levels of categorical variables as a binary variables (1 = belongs to category, 0 = doesn’t belong to the category). Burt matrix is a matrix of two-way cross-tabulations between all the variables in the dataset

The instructions said that you can do MCA for all the columns of just for some of the. Because the tea-data has so many observations and variables, I choosed to use only some of them in the MCA. I think then it is easier for me to interpret. We are also going to make a summary and graphical overview.

#Let's choose the columns we want to keep
library(dplyr)
library(tidyverse)
library(FactoMineR)

keep <- c("Tea", "How", "sugar", "sex", "age_Q", "breakfast", "work", "price")

#Next were are creating a new dataset with the selected columns

teal <- tea[, keep]
library(GGally)
library(ggplot2)

#Next we are going to do the summary and the graphical overview of the new data teal
summary(teal)
##         Tea         How           sugar     sex       age_Q   
##  black    : 74   alone:195   No.sugar:155   F:178   15-24:92  
##  Earl Grey:193   lemon: 33   sugar   :145   M:122   25-34:69  
##  green    : 33   milk : 63                          35-44:40  
##                  other:  9                          45-59:61  
##                                                     +60  :38  
##                                                               
##          breakfast         work                 price    
##  breakfast    :144   Not.work:213   p_branded      : 95  
##  Not.breakfast:156   work    : 87   p_cheap        :  7  
##                                     p_private label: 21  
##                                     p_unknown      : 12  
##                                     p_upscale      : 53  
##                                     p_variable     :112
gather(teal) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust =1, size =8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

Next let’s do the multiple corrrespondence analysis

mca <- MCA(teal, graph = FALSE)

Summary of the model

summary(mca)
## 
## Call:
## MCA(X = teal, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.219   0.185   0.182   0.162   0.154   0.138
## % of var.              9.737   8.231   8.068   7.191   6.831   6.137
## Cumulative % of var.   9.737  17.968  26.036  33.228  40.058  46.195
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.135   0.129   0.127   0.118   0.109   0.109
## % of var.              6.009   5.712   5.653   5.231   4.866   4.835
## Cumulative % of var.  52.203  57.916  63.568  68.799  73.666  78.501
##                       Dim.13  Dim.14  Dim.15  Dim.16  Dim.17  Dim.18
## Variance               0.098   0.092   0.089   0.071   0.068   0.065
## % of var.              4.366   4.102   3.947   3.173   3.035   2.876
## Cumulative % of var.  82.867  86.969  90.916  94.089  97.124 100.000
## 
## Individuals (the 10 first)
##              Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1         |  0.135  0.028  0.004 |  0.235  0.099  0.012 |  0.882  1.428
## 2         |  0.561  0.479  0.162 |  0.751  1.014  0.290 |  0.022  0.001
## 3         | -0.086  0.011  0.005 |  0.591  0.629  0.239 | -0.418  0.321
## 4         | -0.462  0.325  0.192 | -0.469  0.395  0.198 | -0.087  0.014
## 5         |  0.122  0.023  0.011 |  0.262  0.124  0.052 | -0.029  0.002
## 6         | -0.291  0.129  0.033 | -0.279  0.140  0.031 | -0.188  0.065
## 7         |  0.061  0.006  0.002 |  0.309  0.172  0.058 |  0.143  0.038
## 8         |  0.507  0.392  0.115 |  0.556  0.556  0.138 |  0.057  0.006
## 9         |  0.408  0.254  0.069 |  0.340  0.208  0.048 |  0.615  0.695
## 10        |  0.808  0.993  0.280 |  0.220  0.087  0.021 |  0.407  0.304
##             cos2  
## 1          0.163 |
## 2          0.000 |
## 3          0.120 |
## 4          0.007 |
## 5          0.001 |
## 6          0.014 |
## 7          0.012 |
## 8          0.001 |
## 9          0.156 |
## 10         0.071 |
## 
## Categories (the 10 first)
##               Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black     |   1.097  16.950   0.394  10.859 |   0.336   1.881   0.037
## Earl Grey |  -0.532  10.397   0.511 -12.360 |   0.056   0.135   0.006
## green     |   0.652   2.666   0.052   3.962 |  -1.080   8.663   0.144
## alone     |  -0.081   0.243   0.012  -1.908 |  -0.127   0.712   0.030
## lemon     |  -0.121   0.092   0.002  -0.736 |  -0.564   2.366   0.039
## milk      |   0.053   0.034   0.001   0.471 |   0.566   4.545   0.085
## other     |   1.829   5.723   0.103   5.561 |   0.866   1.519   0.023
## No.sugar  |   0.462   6.300   0.228   8.265 |   0.403   5.664   0.174
## sugar     |  -0.494   6.734   0.228  -8.265 |  -0.431   6.055   0.174
## F         |  -0.049   0.080   0.003  -1.015 |   0.288   3.323   0.121
##            v.test     Dim.3     ctr    cos2  v.test  
## black       3.326 |   0.337   1.930   0.037   3.335 |
## Earl Grey   1.296 |  -0.085   0.317   0.013  -1.965 |
## green      -6.566 |  -0.261   0.516   0.008  -1.587 |
## alone      -3.002 |  -0.349   5.462   0.227  -8.232 |
## lemon      -3.431 |   0.555   2.333   0.038   3.374 |
## milk        5.048 |   0.781   8.825   0.162   6.965 |
## other       2.634 |   0.066   0.009   0.000   0.200 |
## No.sugar    7.205 |  -0.319   3.620   0.109  -5.703 |
## sugar      -7.205 |   0.341   3.870   0.109   5.703 |
## F           6.016 |  -0.559  12.787   0.457 -11.685 |
## 
## Categorical variables (eta2)
##             Dim.1 Dim.2 Dim.3  
## Tea       | 0.526 0.158 0.040 |
## How       | 0.107 0.135 0.241 |
## sugar     | 0.228 0.174 0.109 |
## sex       | 0.003 0.121 0.457 |
## age_Q     | 0.550 0.332 0.313 |
## breakfast | 0.000 0.173 0.055 |
## work      | 0.097 0.325 0.056 |
## price     | 0.241 0.063 0.181 |

Visualizing the MCA

plot(mca, invisible = c("ind"), habillage = "quali")

My interpretation of the MCA above: The distance between the different points gives a measure of their similarity. Younger peole (age from 15 to 24 and from 25 to 34) likes to have tea with sugar and at other time than breakfast. And middle age people, age from 35 to 44 and from 45 to 59 prefer to drink tea without sugar.